Enhancing pediatric pneumonia diagnosis through masked autoencoders

Pneumonia, an inflammatory lung condition primarily triggered by bacteria, viruses, or fungi, presents distinctive challenges in pediatric cases due to the unique characteristics of the respiratory system and the potential for rapid deterioration. Timely diagnosis is crucial, particularly in children under 5, who have immature immune systems, making them more susceptible to pneumonia. While chest X-rays are indispensable for diagnosis, challenges arise from subtle radiographic findings, varied clinical presentations, and the subjectivity of interpretations, especially in pediatric cases. Deep learning, particularly transfer learning, has shown promise in improving pneumonia diagnosis by leveraging large labeled datasets. However, the scarcity of labeled data for pediatric chest X-rays presents a hurdle in effective model training. To address this challenge, we explore the potential of self-supervised learning, focusing on the Masked Autoencoder (MAE). By pretraining the MAE model on adult chest X-ray images and fine-tuning the pretrained model on a pediatric pneumonia chest X-ray dataset, we aim to overcome data scarcity issues and enhance diagnostic accuracy for pediatric pneumonia. The proposed approach demonstrated competitive performance an AUC of 0.996 and an accuracy of 95.89% in distinguishing between normal and pneumonia. Additionally, the approach exhibited high AUC values (normal: 0.997, bacterial pneumonia: 0.983, viral pneumonia: 0.956) and an accuracy of 93.86% in classifying normal, bacterial pneumonia, and viral pneumonia. This study also investigated the impact of different masking ratios during pretraining and explored the labeled data efficiency of the MAE model, presenting enhanced diagnostic capabilities for pediatric pneumonia.


Results
In this study, we conducted various experiments to demonstrate the effectiveness of pretraining the MAE model on adult chest X-ray images for diagnosing pediatric pneumonia.The adult chest X-ray images used to pretrain the MAE model were obtained from the publicly available ChestX-ray14 and CheXpert datasets 22,23 .To demonstrate the effectiveness of utilizing MAE for pretraining, we considered three methods to pretrain the backbone model.The first method involved training ResNet-34 and vision transformer (ViT) without pretraining, directly on pediatric pneumonia data with random weights.The second method entailed fine-tuning ResNet-34 and ViT pretrained on the ImageNet dataset with pediatric pneumonia data.The third method employed the MAE architecture, pretraining it with the ImageNet dataset.
Table 1 presents details about the datasets employed in this study.The pretraining datasets, ChestX-ray14 and CheXpert, consist of a total of 303,349 images.The fine-tuning dataset is the pediatric pneumonia chest X-ray dataset, consisting of 5232 training images and 624 test images 24 .Both training and test images are categorized into normal and pneumonia data, with the pneumonia data further being subcategorized into bacterial and viral types.
Table 2 and Fig. 1 present a comprehensive comparison of different pretraining methods and backbone models applied to pediatric pneumonia diagnosis.The chosen backbone models, ResNet-34 and ViT-S, were selected for their comparable parameter count, each totaling approximately 22 million.Models trained from scratch on pediatric data exhibited notable differences.ResNet-34 (Random) model outperformed the ViT-S (Random) model, achieving a higher AUC of 0.977 and accuracy of 90.38%, compared to ViT-S (Random) model's AUC of 0.902 and accuracy of 83.55%.This discrepancy may be attributed to the limited training data relative to ViT-S's model size and its lack of inductive bias.Leveraging pretrained weights from the ImageNet dataset improved the diagnostic capabilities of both ResNet-34 (ImageNet) and ViT-S (ImageNet).ResNet-34 (ImageNet) achieved an AUC of 0.990 and an accuracy of 93.70%, while ViT-S (ImageNet) obtained an AUC of 0.992 and an accuracy of 91.45%.The MAE (ImageNet) model, representing the MAE architecture pretrained on the ImageNet dataset, demonstrated competitive diagnostic metrics with an AUC of 0.989 and an accuracy of 92.63%.Interestingly, all three models pretrained on ImageNet exhibited similar performance metrics.The MAE (Adult) model, pretrained with adult chest X-ray images integrated with ChestX-ray14 and CheXpert datasets, demonstrated outstanding results.Despite the significantly smaller number of adult chest X-ray images compared to the ImageNet dataset, MAE (Adult) exhibited exceptional performance, with an AUC of 0.996 and an accuracy of 95.89%.Furthermore, the model demonstrated high sensitivity (0.996), precision (0.942), and F1-score (0.968), indicating its effectiveness in pediatric pneumonia diagnosis.The p values in Table 2 indicate the overall difference between MAE (Adult) and other methods.When comparing MAE (Adult) to other methods in terms of AUC, accuracy, precision, and F1-score, all have p values less than 0.05, indicating statistical significance.However, for sensitivity, comparing MAE (Adult) to ResNet (Random) and ViT (Random) yielded p values of 0.0290 and < 0.0001, respectively, indicating significant differences.On the other hand, when comparing MAE (Adult) to ResNet-34  www.nature.com/scientificreports/(ImageNet), ViT-S (ImageNet), and MAE (ImageNet), the p values were 0.7658, 0.9871, and 0.5685, respectively, suggesting no significant differences.Table 3 presents the performance metrics of the MAE (Adult) model under different masking ratios.The masking ratio denotes the proportion of masked patches during pretraining.As the masking ratio increased from 0.65 to 0.95, the AUC consistently remains high at 0.996, indicating robust diagnostic capabilities.Accuracy also maintained a high level, ranging from 94.07 to 95.89%, demonstrating the model's ability to correctly classify pediatric pneumonia cases.Sensitivity, precision, and F1-score metrics exhibited similar trends across different masking ratios, with slight variations within acceptable ranges.Notably, at a masking ratio of 0.95, the model demonstrated its best performance.This indicates that the MAE (Adult) model performed effectively across various masking ratios, emphasizing its robustness and reliability in pediatric pneumonia diagnosis.
Figure 2 illustrates the performance metrics of the MAE (Adult) model at a masking ratio of 0.95, corresponding to different fractions of labeled pediatric data.The fraction of labeled data indicates the proportion of pediatric data labeled during the fine-tuning process.As the fraction of labeled data increased from 10 to 100%, the model's performance metrics exhibited notable improvements.The AUC steadily rose from 0.942 to 0.996 showing enhanced diagnostic capabilities.Accuracy followed a similar upward trend, ranging from 0.836 to 0.959 www.nature.com/scientificreports/(note: the accuracy values in Fig. 2 are represented as values between 0 and 1, not as percentages).Sensitivity, precision, and F1-score metrics also consistently improved with increasing fractions of labeled data.Overall, the results emphasized the positive impact of increasing labeled pediatric data on the model's performance, underscoring the importance of data quantity in fine-tuning process.Table 4 provides a comprehensive overview of the performance metrics for classifying pediatric pneumonia into normal, bacterial pneumonia, and viral pneumonia categories using various pretraining data and backbone models.ResNet-34 (Random) achieved notable AUC values for normal (0.984), bacterial pneumonia (0.973), and viral pneumonia (0.923), with an accuracy of 90.88%.ViT-S (Random) showed inferior AUC values (84.47%) compared to ResNet-34 (Random), suggesting potential limitations in capturing subtle pediatric pneumonia patterns.ResNet-34 (ImageNet) demonstrated competitive performance with AUC values across all categories (normal: 0.994, bacterial pneumonia: 0.982, viral pneumonia: 0.942) and an accuracy of 92.61%.ViT-S (Ima-geNet) and MAE (ImageNet) exhibited similar performance, highlighting the advantages of pretraining on ImageNet.MAE (Adult) demonstrated exceptional performance with a masking ratio of 0.65, exhibiting high

Discussion
In this study, we performed several experiments to evaluate the effectiveness of MAE in diagnosing pediatric pneumonia.This was accomplished by pretraining on adult chest X-ray datasets and subsequently fine-tuning on pediatric pneumonia chest X-ray datasets.The experiments yielded valuable insights into the application of MAE in medical imaging, highlighting the following key findings.First and foremost, pretraining with adult chest X-ray images using MAE led to a significant improvement in pediatric pneumonia diagnosis compared to pretraining with ImageNet or without any pretraining.These results can be attributed to considerations from both architectural and data perspectives.Architecturally, the MAE encoder is designed to extract global representations from partial observations, while the decoder learns local representations by reconstructing missing pixels.This unique architecture allows MAE to effectively capture both local and global features of chest X-ray images, resulting in outstanding performance after fine-tuning.From a data viewpoint, a crucial factor is the minimal domain discrepancy between the adult chest X-ray data used for pretraining and the pediatric chest X-ray data used for fine-tuning.Recent research suggests that outof-domain pretraining may not be beneficial for network initialization, possibly due to domain shift.Optimal performance can be achieved through pretraining and fine-tuning under in-domain transfer 21,25 .Despite not being exact match in-domain data between adult chest X-ray and pediatric chest X-ray images, the significant similarities in structural and textural features shared between them seem to contribute significantly to the impressive performance observed in this study.
Secondly, the performance of the MAE model on chest X-ray images was influenced by the masking ratio applied during pretraining.Determining the optimal masking ratio involves considerations of information redundancy in the data.Notably, while BERT employed a 15% masking ratio for language tasks, MAE utilized a 75% masking ratio for image-related tasks 17,26 .Due to the substantial similarity in chest anatomy, chest X-rays inherently exhibit higher information redundancy compared to natural images.Consequently, MAE has favored a 90% masking ratio specifically for chest X-rays 21 .In our experiments, we systematically varied the masking ratios from 65 to 95%, incrementing by intervals of 5%.As detailed in Tables 3 and 4, the optimal masking ratio for classifying normal and pediatric pneumonia was found to be 95%, while for classifying normal, bacterial pneumonia, and viral pneumonia, the optimal masking ratio was 65%.Despite the task-specific variations in optimal masking ratios, higher masking ratios allowed the efficient learning of the ViT encoder with a small number of masked patches.This contributed to reduced time and memory complexities in the learning process.
Lastly, this study demonstrates that MAE operates as a labeled data-efficient model, maintaining comparable performance even when trained on a partial dataset.To experimentally illustrate this, we randomly partitioned the pediatric chest X-ray dataset into subsets of 10%, 20%, 50%, and 100%.As depicted in Fig. 2, the model's performance exhibited improvement with an increasing percentage of labeled pediatric data.These findings suggest that pretraining with adult chest X-ray images using MAE proves to be beneficial for fine-tuning models with limited pediatric chest X-ray images.Typically, training large deep learning models with limited data presents challenges and risks of overfitting due to the abundance of parameters in the model.However, pretraining with MAE enables the model to establish a robust representation of the target dataset, effectively mitigating these challenges.This finding emphasizes the potential applicability of MAE in settings where obtaining large quantities of labeled data is challenging, such as clinical environments.Furthermore, employing a smaller training dataset for training a large deep learning model reduces labeling costs, a significant benefit considering the conventional requirement for a large amount of labeled data in traditional supervised deep learning training.
In this study, the pediatric pneumonia chest X-ray data used for fine-tuning consists of an imbalanced distribution, with 1349 images in the normal category and 3883 images in the pneumonia category.The results discussed earlier were obtained without considering data balancing issue.The pneumonia classification results considering the imbalance data issue are provided in Table S1 of the supplementary materials.To address this, a weighted loss function was implemented by modifying the standard loss function during model fine-tuning phase.In this approach, higher weights were assigned to the minority class (normal category) and lower weights to the majority class (pneumonia category).Specifically, the weights were adjusted to be inversely proportional to the frequency of the classes, ensuring that the minority class received a higher weight while the majority class received a lower weight.When comparing Table S1 and Table 2, both tables show similar performance values and performance trends across various pretraining datasets and backbone models.These results demonstrate comparable levels of AUC, accuracy, sensitivity, precision, and F1-score and suggest consistency in observed patterns.Notably, the reported MAE (Adult) performance values in both tables are almost identical, with standard deviations falling within a similar range.This suggests that the models perform similarly across the two datasets.
To demonstrate the effectiveness of MAE (Adult), we compared its performance with prominent self-supervised learning methods, namely MOCO-v2 and BYOL 15,16 .The training strategy for both MOCO-v2 and BYOL was identical to that of MAE (Adult), involving pretraining with adult chest X-rays exhibiting similar shapes and textures to pediatric chest X-rays, along with minimal domain discrepancy, followed by fine-tuning with pediatric chest X-rays.The results obtained from MOCO and BYOL have been included in Table S2 of the supplementary materials.MOCO-v2 achieved AUC, accuracy, sensitivity, precision, and F1-score values of 0.973, 90.87%, 0.988, 0.881, and 0.931, respectively.When compared to MOCO-v2, MAE (Adult) generally demonstrates higher values across performance metrics.As for BYOL, its AUC, accuracy, sensitivity, precision, and F1-score values are www.nature.com/scientificreports/0.988, 96.05%, 0.998, 0.942, and 0.969, respectively.In comparison to BYOL, MAE (Adult) exhibits a higher AUC while their precision values are identical.Additionally, although BYOL demonstrates slightly better accuracy, sensitivity, and F1-score values than MAE (Adult), the differences are minimal.MOCO-v2, on the other hand, exhibits lower overall performance compared to BYOL and MAE (Adult), possibly due to weak augmentation.Similar to SimCLR, MOCO-v2 requires strong augmentation 14,15 .However, in this study, MOCO-v2 employs weak augmentation to mitigate potential risks, such as cropping or introducing bias to critical elements like informative lesions or organs in medical images.Therefore, it appears to perform less effectively compared to other self-supervised learning methods.Although our comparison is limited to some self-supervised learning approaches, we believe these results sufficiently demonstrate the utility of MAE (Adult).
Several studies have investigated pediatric pneumonia diagnosis using the same dataset and the same official split of pediatric pneumonia chest X-rays employed in this study.Table 5 outlines the research methodologies and performance disparities between this study and others.Kermany et al. fine-tuned the Inception-v3 model, pretrained on the ImageNet dataset, achieving an AUC of 0.968 and an accuracy of 92.80% 24 .Liang and Zheng designed a custom residual network, pretrained on the ChestX-ray14 dataset and fine-tuned with pediatric pneumonia data, resulting in an AUC of 0.953 and an accuracy of 90.50% 5 .Ayan et al. employed ensemble learning, selecting the top-performing CNN models (ResNet-50, MobileNet, and Xception) from the seven models pretrained on the ImageNet dataset.The ensemble achieved an AUC of 0.917 and an accuracy of 93.26% 27 .Mabrouk et al. utilized ensemble learning with DenseNet169, MobileNetV2, and ViT networks pretrained on ImageNet, reaching an accuracy of 93.91% 28 .Kiliçarslan et al. proposed a 4-layer CNN with a superior exponential activation function, achieving an accuracy of 95.37% with randomly initialized weights 29 .Gazda et al. conducted experiments similar to our study, employing contrastive learning to pretrain the ResNet-50 network with CheXpert data.The pretrained network was then fine-tuned with pediatric pneumonia data, resulting in an AUC of 0.977 and an accuracy of 91.50% 30 .Singh et al. fine-tuned the DEIT_B model pretrained on the Ima-geNet dataset.We replicated the results using the same official data split as ours, yielding an AUC of 0.995 and an accuracy of 94.50% 31 .Nisho et al. fine-tuned the EfficientNet with the noisy student network pretrained on the ImageNet dataset.However, they conducted experiments using the COVID-19 pneumonia dataset, which differs from our experiment.Reproducing the results with the pediatric pneumonia chest X-rays dataset using EfficientNet-B4 with the noisy student model, we achieved an AUC of 0.991 and an accuracy of 93.94% 32 .Our approach demonstrates superior performance with an AUC of 0.996 and an accuracy of 95.89%, highlighting its effectiveness compared to the referenced studies.This suggests promising potential for the proposed approach in the context of diagnosing pediatric pneumonia with pediatric chest X-ray images.

Conclusion
In this study, we proposed an approach to enhance the diagnosis of pediatric pneumonia using masked autoencoder.The methodology involved pretraining the ViT model with adult chest X-ray images, followed by finetuning with pediatric chest X-ray data.The experimental findings demonstrated that the MAE model, pretrained with adult chest X-ray images, achieved superior performance in diagnosing pediatric pneumonia, with an accuracy of 95.89%, an AUC of 0.996, a sensitivity of 0.996, a precision of 0.942, and an F1-score of 0.968.Moreover, the classification of pediatric chest X-ray data into normal, bacterial pneumonia, and viral pneumonia categories resulted in an accuracy of 93.86%, with individual AUC values of 0.997 for normal, 0.983 for bacterial, and 0.956 for viral.These outcomes, compared to other methods, emphasize the effectiveness of our proposed approach.This emphasizes the significance of our method in the medical imaging field, particularly in scenarios where labeled data is limited.The promising outcomes anticipate the broader applicability of our proposed method across diverse medical imaging domains in the future.

Adult chest X-ray datasets and a pediatric chest X-ray pneumonia dataset
We employed two datasets, ChestX-ray14 and CheXpert, for the pretraining of MAE models.The ChestX-ray14 dataset comprises 112,120 frontal view chest X-ray images.Among these, 51,708 exhibit one or more pathologies across 14 pathology classes, while the remaining 60,412 images do not show any signs of disease 22 .The CheXpert dataset consists of 224,316 chest radiographs, including both frontal and lateral views.The dataset is annotated Table 5. Performance of other recent works on the pediatric pneumonia chest X-ray dataset.† Reproduced results using the same dataset and the official data split of pediatric pneumonia chest X-rays.The pediatric chest X-ray data used in the study is from the Pediatric Pneumonia Chest X-ray dataset 24 .This dataset consists of 5856 pediatric chest X-ray images, which were divided into training and test datasets.Of the total, 5232 images (1349 normal, 3883 pneumonia) were officially assigned to the training set, while the remaining 624 images (234 normal, 390 pneumonia) formed the test set.Eighty percent of the training data was employed for either fine-tuning the pretrained networks or training networks from scratch, while the rest of the training data was used as validation data.

Pretraining with adult chest X-ray images
Recently, MAE has proven to be effective in pretraining vision transformers (ViT) for the images classification, detection and segmentation tasks 17,33,34 .MAE, a type of denoising autoencoder, is specifically designed to reconstruct the original signal from partial observations.Illustrated in Fig. 3, MAE features an asymmetric encoderdecoder architecture.The MAE encoder, a variant of ViT, exclusively processes visible, unmasked patches.By reconstructing entire images from partially masked inputs, it aggregates contextual information, allowing for the inference of masked image regions 17 .In medical imaging tasks, we assert that contextual information plays a crucial role in reconstructing masked image patches, given the inherent dependence and connection of region of interest (ROI) with its physiological environment and surroundings.In this study, we employed the ChestX-ray14 and CheXpert datasets as inputs for the MAE encoder.The adult chest X-ray images were resized to 256 × 256 pixels and standardized using mean and standard deviation calculated from ImageNet.Following this, we applied random resizing cropping (scale range 0.5-1.0) to 224 × 224 pixels and horizontal flipping for dataset augmentation.However, to prevent the potential risk of cropping or introducing bias to informative lesions or organs, crucial aspects in medical images, no additional data augmentation methods were applied.The preprocessed images were partitioned into non-overlapping 16 × 16 patches.Each input patch was assigned a token through linear projection, with an additional positional embedding.Subsequently, a randomly selected subset, ranging from 65 to 95% of these tokens, was masked.As the MAE encoder exclusively deals with visible and unmasked tokens (5-35%), we achieved efficient pretraining with reduced computational and memory demands.The MAE encoder is encouraged to extract a global representation from partial observations, as its output tokens are employed in the MAE decoder to reconstruct the learnable masked tokens.The MAE decoder processes a full set of tokens by combining encoded visible tokens with learnable mask tokens.Incorporating positional embeddings into all input tokens, the decoder reconstructs patches at their specific masked positions, reshaping the output to generate a reconstructed image.It's important to note that the MAE decoder is employed exclusively during the pretraining process and is intentionally designed to be smaller than the encoder to optimize pretraining efficiency.The MAE is trained using a reconstruction loss, specifically mean squared error between reconstructed and original images.Notably, the loss computation is confined to the masked patches.For optimizing the MAE network, we adopted the AdamW optimizer with parameters ( β 1 = 0.9, β 2 = 0.95 ) and a weight decay of 0.05.The transformer blocks in ViT were initialized using Xavier uniform initialization.The initial learning rate and batch size were set to 1.5e−4 and 128, respectively.The learning rate was warmed up for the initial 20 epochs and was adjusted using a cosine annealing schedule.The MAE pretraining process ran for 800 epochs.The pseudocode for MAE is provided in Table S3 of the supplementary materials.

Figure 1 .
Figure 1.Performance comparison of MAE (Adult) model at masking ratio 0.95.

Table 1 .
Details on datasets utilized in this study: pretraining with ChestX-ray14 and CheXpert, fine-tuning with Pediatric Pneumonia Chest X-ray.

Table 2 .
Performance metrics for classifying normal and pneumonia using various pretraining data and backbone models.p denotes the p values for comparing MAE (Adult).Highest performance values are in bold.

Table 3 .
Performance comparison at different masking ratio of MAE (Adult) model.Highest performance values are in bold.

Table 4 .
PerformanceAUC values (normal: 0.997, bacterial pneumonia: 0.983, viral pneumonia: 0.956) and achieving an accuracy of 93.86%.Similarly, at a masking ratio of 0.95, MAE (Adult) maintained high AUC values (normal: 0.997, bacterial pneumonia: 0.983, viral pneumonia: 0.952) and attained an accuracy of 93.55%.These results highlight the model's effective capture of pediatric pneumonia patterns, emphasizing its robust diagnostic capabilities in pediatric pneumonia diagnosis across different masking ratios.
metrics for classifying pediatric chest X-ray data into normal, bacterial pneumonia, and viral pneumonia categories using different pretraining data and backbone models.Highest performance values are in bold.Vol:.(1234567890)Scientific Reports | (2024) 14:6150 | https://doi.org/10.1038/s41598-024-56819-3www.nature.com/scientificreports/ 23r 14 observations covering 12 pathologies, support devices, and observations with no findings.Specifically, we selectively utilized only frontal view images from the CheXpert dataset, totaling 191,229 images23.Despite both datasets having labels, but we chose not to use them.The final number of images used for MAE pretraining is the sum of both datasets, amounting to 303,349 images.